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ABSTRACT 

Most educational measurement texts distinguish 
between norm-referenced (NR), or relative, methods of assigning 
letter grades to objective test scores, and criterion-referenced 
(CR) , or absolute, methods. Both NR and CR approaches have serious 
limitations in typical classroom situations, and neither approach, in 
its pure form, may be entirely suitable. An alternative method is 
proposed and illustrated with scores from 57 secondary school 
students taking a 26~item objectively scored test. The approach 
involved using a smoothed or fitted cumulative distribution and a 
ratio of standard errors to fix the slope of the line through the 
ideal cut-points. This is a modification of the method of C. H. Beuk 
(1984). The rationale for this type of compromise is that it 
acknowledges the sample status of both the set of test items and the 
group of examinees and shares sampling error equally between NR and 
CR methods. The algorithm has been programmed in PASCAL for the 
microcomputer. A structured grading method of this sort would allow 
teachers of multiple sections or those within the same department to 
give somewhat comparable grades to their students if they used 
agreed-on NR standards and individual CR standards. This compromise 
would be especially useful when an entirely new test is used or an 
unfamiliar group of students is encountered. (Contains 1 table, 2 
figures, and 10 references.) (SLD) 
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Objectives 

Methods for assigning letter grades to a set of objective test scores would seen to be a soitetdiat neglected area of technical 
concern. Most educational aeasureaent texts distinguish between nora-referenced (NR) or relative aethods and doaain/criteriim- referenced 
(CR) or absolute aethods. However, both NR and CR approaches can be seen to have serious liaitations for use in typical classrooi situatioK. 

^ In a rather coaprehensive exaaination, at a relatively low cognitive level (for exaaple, in aastery leamng), and with relatively few 
students, absolute standards such as percentage-correct seea acre ^ropriate in the sense of yielding inte^retable scores (e.g., percent 
aastery of the content doaain). If there are aany students, testing is at a higher cognitive level, and the test is coaprisd of a less than 
" coaprehensive set of iteas, then relative standards (such as z-scores) aight seea acre interpretable (e.g., perc^ile rank in the 
population). That is, if the saaple of students is large enough to be representative of the population and to yield accurate (percentile) 
estiaates of each student's relative position in the pojwlation, then ffi aay be a viable ajproach to grading. Conversely,. if the content 
doaain is clearly defined (factual) as opposed to iaplied (higher level skills) and the sa^le of iteas is large enough to accurately reflect 
the content doaain and to yield accurate estiaates of each student's score (proportion of iteas correct), then CR aay be an appropriate 
grading aethod. 

In a usual classrooa situation, there aight be 20-60 exaainees with 20-60 test iteas at a variety of cognitive levels and therefiire 
neither approach, in its pure fora, seeas particularly well-suited. Recognizing these and other factors, Terwilliger (1989) reconfinds using 
CR for soae grading decisions and NR for others. 

A coaproaise between NR and CR seeas both reasonable and consistent with current practice. Consistent, in that aany teachers use 
absolute standards in the fora of percent-correct (soaetiaes because of school or district poliq), but then_ 'adjust' the raw scores in a 
variety of ways if the distribution of grades seeas inappropriate or iaprobable. Indeed, soae tochers relying solely on NR grading are quick 
to adait that they also have CR 'Halts' and will, for exaiqile, not award an 'A' to any score below a certain percentage-of-iteas-correct. A 
coaproaise is reasonable since a valid interpretation of a CR or a NR grade requires knowledge of either the content doaain or the population 
of students, respectively. The blending of the two approaches aight brtter reflect the actual partial kncwledge of both the content doaain 
and population by the typical consuaer of the grade. Indeed, grades are often seen to reflect, to soae extent, both absolute and relative 
achieveaent. 

FroB the responses of students in ay aeasureaent classes, the aost popular procedures for adjusting the scores would seea to be 
'gapping' or 'eyeballing' (Mainer i Schacht, 1978), adding a fixed nui4)er (or percentage) of points to ev^one's score, dropping iteas, or 
siaply Baking sure the next test results in grades with a coapensating distribution. It is soBewhat ironic that if both the aean and standard 
deviation of a doaain-referenced test are adjusted using a linear transforaation then we have the equivalent of a aost prevalent fora of nora- 
referencing, the z-score. The focus of this paper is on an alternative aethod of adjustaent. 

Theoretical Fraaework and Exaaple with Real Data 

Hofstee (1983) has suggested using a cuaulative frequency distribution to better see the relationship between NR and CR 
decision-aaking. Mhile Hofstee did this in reference to large-scale testing, soae of the principles involved apply equally well to the 
classrooa situation. Figure 1 is an exaaple of a cuaulative nuaber- correct frequency distribution. The scores are froa 57 secondary 
students taking a 26-itea objectively-scored test. A score of 15 would be aj^roxiaately at the 45th percentile. Since there were 5 grading 
categories: A, B, C, D, F, we can identify the expected outcoaes by locating points 1-4 using both the absolute standards we have set and our 
past grading practice with this unit of study. That is, if we have obsened, over aany sections, that 16l of the students received A's, 23l 
received B's, 181 received C's, 201 raeived D's, and 231 received F's, then the NR standards aay be seen on the vertical axis irtiere the 
O' cuaulative proportion of students below each nuiber-correct score is given. If it is felt that 931 or acre of the iteas aust be answered 
O correctly to receive an A, 851 or Bore to receive a B, 751 or aore to raeive a C, and 651 or Bore to receive a D, then the corresponding 

niaijer-correct standards can be seen on the horizontal axis. The intersections of the expected cut-points (1-4) are such that both the NR and 
vn CR expectations are siaultaneously aet if and only if the obsened distribution ^ses through these points. That is, if these intersections 
are on the obsened cuaulative distribution, then we are done; if not, soae daision or coaproaise is necessary. In a sense, the 
O intersections are points on our best estiaate of a population cuaulative distribution. 

< The particular coaproaise suggested by Hofstee involves setting ainiaua and eaxiaua acceptable percentages-correct about each 

^ py poftoH CR cut-point and ainiaua and aaxiaua acceptable proportions of students in each grading category about each NR cut-point to deteraine 
‘ O ‘ ipe of a diagonal line through the aeeting point. This line is then extended to aeet the observed cuaulative distribution and the 
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intersection is the conproiiise. 

BeuX (1984) proposed that the coipronise be obtained by using the ratio of standard deviations of the ratings of a group of judges 
as the slope of the diagonal. De Gruijter (1985) uses esti&ates of the uncertainties concerning both the NR and CR ideals to define a faiily 
of ellipses, selects the tangent ellipse, and uses the abscissa of the intersection as the conproaise cut-score. 

These proceiures all require additional judgoents and do not directly taXe into consideration the notion that, all else teing equal, 
with larger nuabers of exaainees and fewer iteas it ai^t be reasonable to depend aore heavily upon the NR criteria and vice-versa. Biat is, 
if we have a prinarily CR grading philosophy, then we would be concerned that there are a sufficient nunber of iteis to adequately represent a 
clearly identified content doaain and to perait accurate estiaation of a student's doaain score. If priaarily NR, the concern would be to 
have a sufficient nunber of students (representative of the population) in the saiq)le to accurately estiaate a student's percentile ranX. In 
a coaproaise situation, aore wei^t aight be given to the aore accurate ^tiaation at each decision level. 

An additional prc^lea is encountered using the obsened cuiulative distribution. As the ratio of test length to nunber of students 
increases, there will be aore and larger gaps or zero frequencies in the frequency distribution and these will be seen as 'flat spots' in the 
cunulative distribution. Due to these randon gaps and other saiq)le fluctuations, the need to snooth the obsened distribution arises. Nhile 
there are a nunber of saoothing approaches available, the beta-binoaial (or negative hypergeoaetric) aodel has been found to be a nost 
efficient presnoother for equipercentile equating (PairtanX, 1987) and has also been successfully used to nodel nunber-correct achievenent 
test score data (Duncan, 1974; Keats & Lord, 1962). Lord and NovicX (1968) recoaaend this aodel for fitting obsenel distributions of 
nunber-correct scores and provide a theoretical rationale for the nodel. A convenient algorithn for costing the beta-binonial is available 
(Huynh, 1979). 



Method 

The approach followed was to use a saoothed or fitted cunulative distribution and to use a ratio of standard errors to fix the slope 
of the line through the ideal cut-points. iMs is a nodification of BeuX's nethod in that his ratio of standard deviations is also a ratio of 
standard errors (the sane judges are us«l to provide both standard deviations). It is inportant to note, however, that there is no variation 
in our ideal cut-points; these points nay be thought of as teing aXin to population paraneters. 

If we conceptually fix a CR standard and the test, then each saiq)le of students or class froa our (assuael infinite) population of 
students will yield a saiq)le proportion of students at or below this CR standard and this sanple proportion nay te co^ed to the 
hypothesized (population) proportion. The standard error of such proportions is given by (»„(l-f„)/n)°*® idiere n is the nunber of exaainees 
and T„ is the population proportion or CR standard. In exactly the sane way, we nay inagine a single E standard (proportion of itens 
correct=r,c) and class or group of students as fixel and compute the standard error of the proportion of items answered correctly, 
(rk(l’>fk)A)°‘®( as if the X items on our test were a sample from the (assumed infinite) content domain. The compromise is to use the ratio 
of these standard errors, [(ir„(l-f„)/n)°‘®]/[(»n(l-»k)A)°‘®]/ as the slope of the line through the ideal points. 

Tlie rationale for this t^e of compromise is ^at it acXnowledges the sample status of both the set of test items and the group of 
examinees and shares the sampling error equally between methods (E and CR). In particular, the sanple of itens is given the sane credibility 
as the sample of students in that the compromise at each decision level departs from the E and CR standards by the same number of standard 
errors. 

In practice, standard errors are largely influenced by sample size and this means that when the ratio of number of items to number 
of examinees is large, the tendency will be for the compromise to rely more heavily on the CR standards. Nhen the ratio is small, the 
compromise will rely more heavily on the E standards. That is, reliance is placed on both E and CR standards, but the compromise at each 
dKision point tends to proportionally favor the standard with the smaller standard error. 

In the example, the CR cut-point between a grade of B and C was to be a percentage-correct score of 85i. llie standard error of a 
proportion of items is (ric(l-fk)A)°‘* ^ is the number of items on the test or, (0.85*(l-0.85)/26)°‘® = 0.070. The corresponding 

standard error of a proportion of persons below a grade of B is (f„(l-f„)/n)°-® where n is the nunber of examinees or, (0.39*(l-0.39)/57)°-® 
= 0.065. The resulting ratio of 0.065/0.070 = 0.923 would be negated, convert®! to the nunber-correct scale, and used as the slope of the 
line through point 3, see Figure 2. Linear interpolation is then used with the smoothed cumulative distribution and the resulting abscissa of 
the point of intersection (b) is 18.047. Biis is the suggested compromise B/C cut-score shown in Figure 2 for grading purposes with this test 
given the CR and E paraneters. llie calculations for the other three cut-points are similar. Note that the slopes are all less than 1 for 
this example. This is the result of somewhat greater reliance on the E standards than on the CR standards in arriving at the compromise 
since there were 57 students and 26 test iteis. llie ratio of standard errors, however, is not just the ratio of nuid)er of persons to nini)er 
of items, but also reflects the expected proportions. 

By using a constant times the ratio of standard errors, we could adjust these cut-scores to yield any desired weighting of E-CR 
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standards, perhaps to better reflect the cognitive level of the najority of the test itens. It is interesting to note that setting the 
constant (and hence the slope) to a value near zero results in an equipercentile equating of the snoothed observed score distribution to the 
'noming group' distribution defined by the ideal points. This would seen to be a reasoned nethod of relative grading if the ideal points 
were derived from data over sany sections. 

To Bake the scores aore understandable to students and others, it aay be desirable to follow the popular practice of Resenting the 
results as adjusted raw scores or adjusted percentages that can then be coapared to the stated standards for letter grade decisions. The 
scores, percentages, adjusted scores, and adjusted percentages are shown in Table 1. Letter grades for this exaaple are also shown in Table 1 
where the HR letter grades were calculated using z-scores with cut-scores that reflect the ex^ed percentages of A's, B's, and so on. The 
suggested or coapronise letter grades are in the last coluan labeled NR/CR. Rote how the NR/CR grades aediate the HR and CR grades soaewhat 
differently at each score level. This procedure is not equivalent to a sii?)le 'averaging' of NR and CR grades. 

For the exaaple data, the aean nuaber-correct score is 15.47, the standard deviation is 4.16, and the reliability (KR-21) is 0.66. 
Using a Kolaogorov-Siiimov one saaple test of fit, the aaxima absolute difference is 0.095. The null hypothesis (todel fits the data) is 
accepted at p = 0.985. The beta^)inoaial has successfully fit (consenatively, at o = 0.20) over 951 of real data sets so far investigated 
and has fit 1001 of the author's classrooa data for the past two years. 

The algoritha has been prograaaed in (standard) Pascal for the IBH aicrocoaputer. There are several additional outputs, the prograa 
can be run in batch node or interactively, and there is an accoipnying docuaent. It is available without cost froa the author when the 
request is accoapanied by a foraatted disk and staaped aailer. 

Conclusions and Educational laportance 

Grades are iaportant: they are the coin-of-the-reala in education. Many teachers find the task of evaluation difficult and night 
welcoae a structured aethod for obtaining, at least, suggested letter grades in those situations \Aere adherence to absolute standards would 
result in an unacceptable distribution of letter grades. 

Continuing to adjust proportion-correct standards by ad hoc aethods is neither re^oned nor reliable. Structured grading aethods 
such as this would also pemit teachers of aultiple sections or those within the sane departaent to give soaewhat coiprable grades to their 
students if they used agreed-upon NR standards and individual CR standards that reflect professional judgeaent about differences in the 
difficulties and objectives of their individual tests. The use of a coaputer to assist in grading decisions aeans that practical and useable 
approaches need not be overly siaplistic. This coaproaise ai^t prove aost useful sdien an entirely new test is used or ^rtlen an unfaailiar 
group of students is encountered. When a teacher is obliged to adhere to grading standards as in the exaaple (931 and above for an 'A'), 
giving tests that challenge all students and that reflect higher level cognitive skills becoaes virtually iapossible without soae aeans of 
score adjustaent. 'Eyeballing' a set of scores is si^ly not good grading practice. 
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Figure 1 Cumulative distribution of the observed number-correct 
raw scores. 
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Figure 2 Cumulative distribution of the smoothed number-correct 
raw scores. 




Table 1 Exaople Data With NS, CS, and Conpronise Grades 



X 


Freq 


1 


Adj. X 


Adj. 1 


z-score 


NR CR NR/CR 


23 


2 


88.5 


24.81 


95.44 


1.808 


A 


B 


A 


22 


3 


84.6 


24.42 


93.91 


1.568 


A 


C 


A 


21 


2 


80.8 


23.93 


92.05 


1.328 


A 


C 


B 


20 


2 


76.9 


23.31 


89.66 


1.087 


A 


C 


B 


19 


6 


73.1 


22.69 


87.27 


0.847 


B 


D 


B 


18 


2 


69.2 


22.05 


84.80 


0.607 


B 


D 


C 


17 


8 


65.4 


20.97 


80.67 


0.367 


B 


D 


C 


16 


6 


61.5 


19.90 


76.53 


0.126 


C 


F 


C 


15 


3 


57.7 


18.87 


72.57 


-0.114 


C 


F 


D 


14 


5 


53.8' 


17.87 


68.72 


-0.354 


D 


F 


D 


13 


5 


50.0 


16.86 


64.84 


-0.594 


D 


F 


F 


12 


2 


46.2 


15.56 


59.85 


-0.835 


F 


F 


F 


11 


2 


42.3 


14.26 


54.86 


-1.075 


F 


F 


F 


10 


5 


38.5 


12.97 


49.87 


-1.315 


F 


F 


F 


9 


1 


34.6 


11.67 


44.89 


-1.555 


F 


F 


F 


8 


2 


30.8 


10.37 


39.90 


-1.796 


F 


F 


F 


5 


1 


19.2 


06.48 


24.94 


-2.516 


F 


F 


F 



Note. X is the raw nuBber-correct score; NK/CR is 
the conpronise grade. 
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